Multiclass linear classifier:
How we can interpret:
Limitations:
Use softmax function to convert scores to probabilities:
$$ s = f(x,W) \\ P(Y=k|X=x)=\frac{e^{s_k}}{\sum_j e^{s_j}} $$Steps:
Example of multiclass SVM loss:
Takeaways:
Example of how this is calculated with image class prediction:
Cross-entropy and MLE:
Example using the image classification task:
Takeaways:
Regularization accounts for choosing simpler models over complex ones. This is applied to the loss function:
L1 regularization norms the weight vector: $$ L_i = |y-Wx_i|^2+|W| $$
Given a scalar $s \in \mathbb{R}^1$ and vector $v \in \mathbb{R}^m$:
Q: Given 2 vectors, what's the size of $\frac{\partial v^1}{\partial v^2}$?
Q: Given a scalar and a matrix, what's the size of $\frac{\partial s}{\partial M}$?
Q: What is the size of $\frac{\partial L}{\partial W}$?
Takeaway:
What makes it different:
What is deep learning:
Features are engineered in traditional ML:
Features are automatically extracted in deep learning:
Example of features for image detection:
Representation of data is done through building up a network of simple functions into a complex network.
"End-to-end": Learning is applied to entire spectrum, from raw data -> feature extraction -> classification.
Algorithm:
Applying batch gradient descent:
Convergence notes:
How to compute $\frac{\partial L}{\partial W_i}$?
Derivation of update rule using squared loss:
The partial derivative of this summation (with respect to $w_j$ is really causing most of the terms to be zero because when $i$ is not equal to $j$ then none of those weights actually affect $w_j$.
(Some more context on getting the partial derivative of update rule above)
Update rule once we add non-linearity (sigmoid) - Gets more complex:
Manual differentiation can get messy. We can decompose the complicated function into modular sub-blocks.
Key ideas:
Distributed representation: Toy example
(Think of this as another view in combination with the linear algebra view)
Output can be modulated by a non-linear functions (e.g. sigmoid)
Connecting Many Neurons
Terms:
The magic of NN is that we can stack multiple layers together:
From a linear algebra view, a 2-layer NN coresponds to adding another weight matrix:
We can build deeper network by ading more layers.
Adding even more layers:
Problem:
Answer:
Example representation of a function as a graph:
Use matrix calculus to get derivatives of local gradients:
Computing the gradients of loss:
Summary:
Key Idea: Backpropagation is the application of gradient descent to a computation graph via the chain rule.
Key Idea:
Computation as a Graph:
Automatic Differentiation: A family of algorithms for implementing chain-rule on computation graphs.
Example Computation Graph:
Partial derivatives from $a_3$ upstream:
Notes:
Different operations have different effects on the gradient.
Key Idea: If gradients do not flow backwards properly, learning slows or stops
Key Ideas:
Key ideas:
Key ideas:
backward()). Don't need to define forward/backwards function like in the past.Power of auto-diff stems from the idea of Differentiable programming:
Components:
Key Idea: Machine learning functions (input -> model -> loss) is also a computational graph
Key Idea: All we need to do is define the forward function, no backwards.
Key Idea: Chain rule can be computed via a series of operations on scalars, vectors, and matrices.
Example of logistic regression:
Example of input/weights/output dimensions of a fully connected layer:
Sizes of Jacobians (gradients):
ReLU is a substitue for the sigmoid.
(ReLU and other activation functions):
Jacobian of ReLU
Key Ideas:
Aside: Why better gradient flow than sigmoid? (ChatGPT)
Rectified Linear Units (ReLU) and sigmoid activation functions behave differently in terms of gradient flow during the backpropagation process in neural networks, which can impact training efficiency. ReLU tends to provide better gradient flow than sigmoid for a few reasons:
Vanishing Gradient Problem: Sigmoid activation functions have a range between 0 and 1, which causes gradients to become very small when the input values are far from 0. This is particularly problematic during backpropagation because small gradients can lead to slow convergence or even complete stagnation of learning. ReLU, on the other hand, has a flat gradient for positive inputs, which can help mitigate the vanishing gradient problem and allow for faster learning.
Non-linearity and Sparsity: While sigmoid provides non-linearity to the network, ReLU introduces a sparsity aspect. This is because ReLU outputs 0 for negative inputs, effectively deactivating the neuron. This sparsity can make the network more efficient by reducing the number of active neurons and simplifying the representation of data, which can enhance gradient flow through the network.
Efficient Computation: ReLU is computationally more efficient than sigmoid. The sigmoid function requires exponentiation and division operations, which can be more costly in terms of computation compared to the simple thresholding operation of ReLU. This computational efficiency can contribute to faster training times.
Avoiding Saturation: Sigmoid saturates to either 0 or 1 when its inputs are very large or very small, causing the gradients to be close to zero. In such cases, the network's weights don't update effectively, slowing down learning. ReLU does not saturate for positive inputs, allowing gradients to flow more effectively through the network.
Initialization: Initialization techniques, like He initialization, have been specifically designed for ReLU activations. These initialization methods help prevent gradients from becoming too small during the early stages of training, promoting better gradient flow.
However, it's important to note that ReLU isn't without its drawbacks. It can suffer from a problem known as the "dying ReLU" problem, where a large portion of the neurons can become inactive and never recover during training. This issue has led to the development of variants like Leaky ReLU, Parametric ReLU, and Exponential Linear Units (ELU) to address the dying ReLU problem while retaining the benefits of improved gradient flow.
In summary, ReLU generally provides better gradient flow compared to sigmoid due to its sparsity, non-saturation, computational efficiency, and avoidance of the vanishing gradient problem. These factors collectively contribute to faster and more effective training of neural networks.